This notebook follows some of the examples in Visualization in Bayesian Workflow by Gabry, Simpson, Vehtari, Betancourt, and Gelman.

Our goal for this problem is to estimate ground-station values of \(PM_{2.5}\) (air concentrations of particulate matter of 2.5 microns or below), with uncertainty intervals, from satellite data. That is, we want to use a high-quality, reliable dataset that suffers from sparse and biased collection to calibrate a noisier dataset that we can collect anywhere. The plan:

  • Do some exploratory data analysis
  • Formulate a model
  • Evaluate how we’ve specified the model, before running inference, using samples from the prior predictive distribution
  • Run inference and inspect the MCMC results visually
  • Evaluate the model’s performance using samples from the posterior predictive distribution
  • Define a more-complicated model and use these tools to compare them
library(tidyverse)
library(rstan)
library(tidybayes)
library(knitr)

# plot themes
theme_set(theme_minimal())

# stan options
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

1. Import Data

data <- read_csv(
    "../../data/particulate_data.csv", 
    locale = locale(encoding="iso-8859-1"),
    col_types = "cccicddddc"
)

# add a column to track city id
data <- data %>% mutate(city_id = row_number())

# show data
data
## # A tibble: 2,980 x 11
##    City_locality iso3  country super_region super_region_na…  pm25 sat_2014
##    <chr>         <chr> <chr>          <int> <chr>            <dbl>    <dbl>
##  1 <NA>          AUS   Austra…            1 HighIncome        33.5    33.3 
##  2 ´è_´è__lhavo  PRT   Portug…            1 HighIncome        15       9.87
##  3 ´è_´è__vila   ESP   Spain              1 HighIncome        10       9.18
##  4 AACHEN        DEU   Germany            1 HighIncome        13      15.5 
##  5 AALBORG       DNK   Denmark            1 HighIncome        11       9.80
##  6 AALEN         DEU   Germany            1 HighIncome        12      19.3 
##  7 AALESUND      NOR   Norway             1 HighIncome         7.5     4.73
##  8 AARSCHOT      BEL   Belgium            1 HighIncome        15      15.8 
##  9 Aba           NGA   Nigeria            7 Sub-Saharan Afr   49      45.8 
## 10 Abakaliki     NGA   Nigeria            7 Sub-Saharan Afr   28      52.2 
## # ... with 2,970 more rows, and 4 more variables: log_pm25 <dbl>,
## #   log_sat <dbl>, cluster_region <chr>, city_id <int>
data[,8:11] # just to show the rest of the columns
## # A tibble: 2,980 x 4
##    log_pm25 log_sat cluster_region city_id
##       <dbl>   <dbl> <chr>            <int>
##  1     3.51    3.51 2                    1
##  2     2.71    2.29 2                    2
##  3     2.30    2.22 2                    3
##  4     2.56    2.74 2                    4
##  5     2.40    2.28 2                    5
##  6     2.48    2.96 2                    6
##  7     2.01    1.55 2                    7
##  8     2.71    2.76 2                    8
##  9     3.89    3.82 4                    9
## 10     3.33    3.96 4                   10
## # ... with 2,970 more rows

2. Exploratory Data Analyis

Since we’re relating two variables, expect a lot of scatter plots.

Start with the logarithm of the ground-station measurements, plotted against the logarithm of the satellite measurements:

ggplot(data, aes(x = log_sat, y = log_pm25)) +
    geom_point() +
    geom_smooth(method = "lm")

If you squint, the linear regression line seems like an OK(ish) place to start; the data definitely has an overall linear correlation. The focus of the rest of this notebook will be on the extent to which deviations from this linear relationship are going to mess us up.

If we look closer, however, the errors don’t seem to be totally random (e.g. they’re not I.I.D. normal) - the data seems to have some structure to it, with some areas tending to have positive residuals and other areas negative.

We can visualize this easier by plotting the difference between log_pm25 and log_sat on the y-axis instead of just log_pm25 by itself:

ggplot(data, aes(x = log_sat, y = log_pm25 - log_sat)) +
    geom_point() +
    geom_smooth(method = "lm") +
    geom_hline(yintercept = 0, color = "red", linetype = "dashed")

If the errors were IID we would expect them to be normally distributed around 0 (the red dashed line). But there is clearly a non-random pattern here. Let’s do some more exploration to see if we can (qualitatively) understand where this is coming from.

One possibility is that different types of locations have different relationships between satellite and ground station data. We could do a quick check by breaking out the above plot by WHO super-regions:

ggplot(data, aes(x = log_sat, y = log_pm25, color = super_region_name)) +
    geom_point(alpha = 0.3) +
    geom_smooth(method = "lm", se = FALSE) + 
    theme(legend.position="bottom")

Notice that the different geographic areas only partially overlap, and regression lines fit to each show very different relationships. Gabry et al also tried grouping stations using hierarchical clustering (labeled cluster_region in the dataset):

ggplot(data, aes(x = log_sat, y = log_pm25, color = cluster_region)) +
    geom_point(alpha = 0.3) +
    geom_smooth(method = "lm", se = FALSE) + 
    theme(legend.position="bottom")

Just as in the super-region plot, we see very different dependences here. When we start modeling, we’ll need to be on the lookout for biases in predictions on different types of locations.

3. Baseline Model Prior

To have something to benchmark against, let’s start by fitting a Bayesian linear regression,

\[ y_i \sim N(\beta_0 + \beta_1x_i, \sigma^2) \] where \(y_i\) is the ground station \(PM_{2.5}\) measurement, \(x_i\) is this satellite measurement, and \(\beta_0\), \(\beta_1\), and \(\sigma\) are the parameters we need to infer. We’ll use a normal distribution for priors on \(\beta_0\) and \(\beta_1\), and a half-normal for \(\sigma\) (since it has to be positive).

Before fitting anything, we should do a keep yourself honest check to see how well we’ve specified our model. Even without touching stan, it’s easy to draw fake data from the prior predictive distribution:

  • draw samples \(\beta_0^\prime\), \(\beta_1^\prime\), and \(\sigma^\prime\) for each of the parameters from their prior distribution
  • for each data point \(x\), compute an estimate \(y^\prime = \beta_0^\prime + \beta_1^\prime x + \epsilon\sigma^\prime)\) where \(\epsilon \sim N(0,1)\)

Let’s plot some of our prior predictive datasets alongside our actual data, starting with weakly informative priors:

prior_weak_file <- "../../stan/particulate_matter_ols_prior_weak.stan"
cat(read_file(prior_weak_file))
// ols model for particulate matter
// generate from the prior
data {
  int<lower=1> N;       // number of observations
  vector[N] log_sat;    // log of satellite measurements
}
parameters {
  real beta0;           // global intercept
  real beta1;           // global slope
  real<lower=0> sigma;  // error sd for Gaussian likelihood
}
model {
  // priors
  beta0 ~ normal(0, 100);
  beta1 ~ normal(0, 100);
  sigma ~ normal(0, 100);
}
generated quantities {
  vector[N] log_pm_prior; // vector of data generated from prior

  for (i in 1:N) {
      log_pm_prior[i] = normal_rng(beta0 + beta1 * log_sat[i],sigma);
  }
}

Compile this stan model.

prior_weak_model <- stan_model(prior_weak_file)

Now let’s sample from this model.

prior_data <- list(N = nrow(data), log_sat = data$log_sat)
prior_weak_samples <- sampling(prior_weak_model, data = prior_data, iter = 300, chains = 1)
## 
## SAMPLING FOR MODEL 'particulate_matter_ols_prior_weak' NOW (CHAIN 1).
## 
## Gradient evaluation took 5e-06 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0.05 seconds.
## Adjust your expectations accordingly!
## 
## 
## Iteration:   1 / 300 [  0%]  (Warmup)
## Iteration:  30 / 300 [ 10%]  (Warmup)
## Iteration:  60 / 300 [ 20%]  (Warmup)
## Iteration:  90 / 300 [ 30%]  (Warmup)
## Iteration: 120 / 300 [ 40%]  (Warmup)
## Iteration: 150 / 300 [ 50%]  (Warmup)
## Iteration: 151 / 300 [ 50%]  (Sampling)
## Iteration: 180 / 300 [ 60%]  (Sampling)
## Iteration: 210 / 300 [ 70%]  (Sampling)
## Iteration: 240 / 300 [ 80%]  (Sampling)
## Iteration: 270 / 300 [ 90%]  (Sampling)
## Iteration: 300 / 300 [100%]  (Sampling)
## 
##  Elapsed Time: 0.100519 seconds (Warm-up)
##                0.053731 seconds (Sampling)
##                0.15425 seconds (Total)

Extract the samples into a data frame using the tidybayes function gather_draws(). This produces some new columns that you may have not seen before:

  • .chain the chain number. Since we only ran one chain these will all be 1.
  • .iteration the iteration number within each chain.
  • .draw a unique number for each draw from the posterior.
  • city_id the subscript for the variable (in this case for log_pm_prior). Since we are estimating 2980 observations this will vary from 1 to 2980.
  • log_pm_prior the variable name.
log_pm_weak <- prior_weak_samples %>% 
    spread_draws(log_pm_prior[city_id]) %>% 
    ungroup()

log_pm_weak
## # A tibble: 447,000 x 5
##    .chain .iteration .draw city_id log_pm_prior
##     <int>      <int> <int>   <int>        <dbl>
##  1      1          1     1       1         341.
##  2      1          1     1       2         262.
##  3      1          1     1       3         290.
##  4      1          1     1       4         328.
##  5      1          1     1       5         324.
##  6      1          1     1       6         329.
##  7      1          1     1       7         255.
##  8      1          1     1       8         339.
##  9      1          1     1       9         369.
## 10      1          1     1      10         375.
## # ... with 446,990 more rows

Note the 447000 rows is a result of the 2980 observations * 1 chain * 300/2 iterations (1/2 are lost to burn in)

Let’s take a look at 9 random draws from our estimates. We’ll also add in the original data so we can compare against.

log_pm_weak_comparison <- log_pm_weak %>% 
    filter(.draw %in% sample(max(.draw), size = 9L)) %>% 
    left_join(data %>% select(city_id, log_sat, log_pm25), by = "city_id") %>% 
    select(.draw, city_id, log_sat, observed = log_pm25, sampled = log_pm_prior)

log_pm_weak_comparison
## # A tibble: 26,820 x 5
##    .draw city_id log_sat observed sampled
##    <int>   <int>   <dbl>    <dbl>   <dbl>
##  1     8       1    3.51     3.51   1029.
##  2     8       2    2.29     2.71    567.
##  3     8       3    2.22     2.30    450.
##  4     8       4    2.74     2.56    548.
##  5     8       5    2.28     2.40    609.
##  6     8       6    2.96     2.48    610.
##  7     8       7    1.55     2.01    429.
##  8     8       8    2.76     2.71    670.
##  9     8       9    3.82     3.89    772.
## 10     8      10    3.96     3.33    740.
## # ... with 26,810 more rows

Let’s visualize these results to see if our prior distribution is producing feasible values for log_pm25.

log_pm_weak_comparison %>% 
    gather("type", "log_pm_25", observed:sampled) %>% 
    ggplot(aes(x = log_sat, y = log_pm_25, color = type)) +
    geom_point(alpha = 0.3) + 
    facet_wrap(~ .draw)

Some of these simulations include predictions orders of magnitude away from what we actually observe (remember we’re working on log scales here!)! In some cases the slope also goes the wrong way - these priors generate data that’s decidedly unphysical. This is bad for two reasons:

  • It means the prior doesn’t really represent our prior knowledge, which defeats the entire conceptual point of this branch of statistics
  • More practically, it means that the data has to do more work than it should to constrain the prior. We’ll tend to get posteriors that overestimate uncertainty and (if we do a really bad job specifying the prior) may be biased.

So what should we be looking for? From Gabry et al,

What do we need in our priors? This suggests we need containment: priors that keep us inside sensible parts of the parameter space.

So our priors should be broader than (and include) our data - but not be so broad that they make impossible predictions. Let’s try again with our intercept constrained to be near zero, slope biased toward being a small poisitive number (since we already know the measurements are roughly correlated) and noise confined to near the unit scale:

prior_file <- "../../stan/particulate_matter_ols_prior.stan"
cat(read_file(prior_file))
// ols model for particulate matter
// generate from the prior
data {
  int<lower=1> N;       // number of observations
  vector[N] log_sat;    // log of satellite measurements
}
parameters {
  real beta0;           // global intercept
  real beta1;           // global slope
  real<lower=0> sigma;  // error sd for Gaussian likelihood
}
model {
  // priors
  beta0 ~ normal(0,1);
  beta1 ~ normal(1,1);
  sigma ~ normal(0,1);
}
generated quantities {
  vector[N] log_pm_prior; // vector of data generated from prior

  for (i in 1:N) {
      log_pm_prior[i] = normal_rng(beta0 + beta1 * log_sat[i],sigma);
  }
}

Compile this stan model.

prior_model <- stan_model(prior_file)

Now let’s sample from this model.

prior_samples <- sampling(prior_model, data = prior_data, iter = 300, chains = 1)
## 
## SAMPLING FOR MODEL 'particulate_matter_ols_prior' NOW (CHAIN 1).
## 
## Gradient evaluation took 5e-06 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0.05 seconds.
## Adjust your expectations accordingly!
## 
## 
## Iteration:   1 / 300 [  0%]  (Warmup)
## Iteration:  30 / 300 [ 10%]  (Warmup)
## Iteration:  60 / 300 [ 20%]  (Warmup)
## Iteration:  90 / 300 [ 30%]  (Warmup)
## Iteration: 120 / 300 [ 40%]  (Warmup)
## Iteration: 150 / 300 [ 50%]  (Warmup)
## Iteration: 151 / 300 [ 50%]  (Sampling)
## Iteration: 180 / 300 [ 60%]  (Sampling)
## Iteration: 210 / 300 [ 70%]  (Sampling)
## Iteration: 240 / 300 [ 80%]  (Sampling)
## Iteration: 270 / 300 [ 90%]  (Sampling)
## Iteration: 300 / 300 [100%]  (Sampling)
## 
##  Elapsed Time: 0.054332 seconds (Warm-up)
##                0.054696 seconds (Sampling)
##                0.109028 seconds (Total)
log_pm_prior <- prior_samples %>% 
    spread_draws(log_pm_prior[city_id]) %>% 
    ungroup()

log_pm_prior
## # A tibble: 447,000 x 5
##    .chain .iteration .draw city_id log_pm_prior
##     <int>      <int> <int>   <int>        <dbl>
##  1      1          1     1       1       -1.63 
##  2      1          1     1       2       -1.36 
##  3      1          1     1       3       -0.949
##  4      1          1     1       4       -1.57 
##  5      1          1     1       5       -1.27 
##  6      1          1     1       6       -2.06 
##  7      1          1     1       7       -0.720
##  8      1          1     1       8       -1.89 
##  9      1          1     1       9       -2.29 
## 10      1          1     1      10       -2.63 
## # ... with 446,990 more rows

Let’s take a look at 9 random draws from our estimates. We’ll also join in the original data (by city_id) so we can compare against.

log_pm_comparison <- log_pm_prior %>% 
    filter(.draw %in% sample(max(.draw), size = 9L)) %>% 
    left_join(data %>% select(city_id, log_sat, log_pm25), by = "city_id") %>% 
    select(.draw, city_id, log_sat, observed = log_pm25, sampled = log_pm_prior)

log_pm_comparison
## # A tibble: 26,820 x 5
##    .draw city_id log_sat observed sampled
##    <int>   <int>   <dbl>    <dbl>   <dbl>
##  1     3       1    3.51     3.51    9.90
##  2     3       2    2.29     2.71    6.10
##  3     3       3    2.22     2.30    7.53
##  4     3       4    2.74     2.56    6.00
##  5     3       5    2.28     2.40    7.23
##  6     3       6    2.96     2.48    9.58
##  7     3       7    1.55     2.01    3.72
##  8     3       8    2.76     2.71    8.68
##  9     3       9    3.82     3.89    8.44
## 10     3      10    3.96     3.33    8.48
## # ... with 26,810 more rows

Let’s visualize these results to see if our prior distribution is producing feasible values for log_pm25.

log_pm_comparison %>% 
    gather("type", "log_pm_25", observed:sampled) %>% 
    ggplot(aes(x = log_sat, y = log_pm_25, color = type)) +
    geom_point(alpha = 0.3) + 
    facet_wrap(~ .draw)

4. Baseline Model Inference

Here’s Gabry et al’s code for the basic regression model:

model_file <- "../../stan/particulate_matter_ols.stan"
cat(read_file(model_file))
// ols model for particulate matter
data {
  int<lower=1> N;       // number of observations
  vector[N] log_sat;    // log of satellite measurements
  vector[N] log_pm;     // log of ground PM_2.5 measurements
}
parameters {
  real beta0;           // global intercept
  real beta1;           // global slope
  real<lower=0> sigma;  // error sd for Gaussian likelihood
}
model {
  // priors
  beta0 ~ normal(0,1);
  beta1 ~ normal(1,1);
  sigma ~ normal(0,1);

  // likelihood
  log_pm ~ normal(beta0 + beta1 * log_sat, sigma);
}
generated quantities {
  vector[N] log_lik;    // pointwise log-likelihood for LOO
  vector[N] log_pm_rep; // replications from posterior predictive dist

  for (n in 1:N) {
    real log_pm_hat_n = beta0 + beta1 * log_sat[n];
    log_lik[n] = normal_lpdf(log_pm[n] | log_pm_hat_n, sigma);
    log_pm_rep[n] = normal_rng(log_pm_hat_n, sigma);
  }
}

Compile this stan model.

model <- stan_model(model_file)

Now let’s sample from this model.

model_data <- list(N = nrow(data), log_sat = data$log_sat, log_pm = data$log_pm25)
model_samples <- sampling(model, data = model_data, iter = 1000, chains = 4)

And extract the draws for \(\beta_0\), \(\beta_1\), and \(\sigma\).

coef_samples <- model_samples %>% 
    spread_draws(beta0, beta1, sigma) %>% 
    ungroup()

coef_samples
## # A tibble: 2,000 x 6
##    .chain .iteration .draw beta0 beta1 sigma
##     <int>      <int> <int> <dbl> <dbl> <dbl>
##  1      1          1     1 0.907 0.700 0.453
##  2      1          2     2 0.890 0.706 0.444
##  3      1          3     3 0.865 0.712 0.451
##  4      1          4     4 0.856 0.714 0.448
##  5      1          5     5 0.851 0.713 0.449
##  6      1          6     6 0.936 0.689 0.450
##  7      1          7     7 0.925 0.690 0.451
##  8      1          8     8 0.930 0.697 0.450
##  9      1          9     9 0.912 0.689 0.453
## 10      1         10    10 0.912 0.688 0.452
## # ... with 1,990 more rows

4.1 Trace Plots

Gabry’s paper has some slick visualizations for investigating your MCMC run, that we’ll talk about in the future when we discuss Hamiltonian Monte Carlo. For now, just do some quick trace plots to ensure nothing terrible happened:

coef_samples %>% 
    gather("parameter", "value", beta0:sigma) %>% 
    ggplot(aes(x = .draw, y = value)) +
    geom_line() + 
    facet_wrap(~parameter, ncol = 1, scales = "free_y")

Looks like we’re mixing OK.

Not surprisingly, the slope and intercept show some negative correlation:

ggplot(coef_samples, aes(x = beta0, y = beta1)) + geom_hex() + scale_fill_viridis_c()

ggplot(coef_samples, aes(x = beta0, y = sigma)) + geom_hex() + scale_fill_viridis_c()

ggplot(coef_samples, aes(x = beta1, y = sigma)) + geom_hex() + scale_fill_viridis_c()

5. Baseline Model Posterior Predictive Checks

Now that we’ve fit a model, we can draw data from the posterior predictive distribution \(p(y^\prime|y)\). We can do a couple types of things with this fake data to get some intuition for how our model is doing:

  • Directly compare samples from the PPD with real data
  • Compute summary statistics on the PPD; compare range of values to the same statistic on the observed dataset.

In either case we can break those checks out by subsets of the data (like WHO super region) to see if the model describes data equally well in different regions, or whether we’ll tend to see geographic biases in any estimates we make from the model.

Let’s extract the posterior predictive distribution samples:

log_pm_ppd <-  model_samples %>% 
    spread_draws(log_pm_rep[city_id]) %>% 
    ungroup() %>% 
    select(-.chain, -.iteration)

log_pm_ppd
## # A tibble: 5,960,000 x 3
##    .draw city_id log_pm_rep
##    <int>   <int>      <dbl>
##  1     1       1       2.33
##  2     2       1       3.34
##  3     3       1       4.16
##  4     4       1       2.60
##  5     5       1       3.37
##  6     6       1       3.36
##  7     7       1       2.79
##  8     8       1       3.15
##  9     9       1       4.52
## 10    10       1       3.13
## # ... with 5,959,990 more rows

5.1 Direct Comparisons

Here’s a flipbook of scatter plots, comparing PPD datasets with the real. Again we sample from 9 random draws.

log_pm_ppd9 <- log_pm_ppd %>%
    filter(.draw %in% sample(max(.draw), size = 9L)) %>%
    left_join(data %>% select(city_id, log_sat, log_pm25), by = "city_id") %>% 
    select(.draw, city_id, log_sat, observed = log_pm25, sampled = log_pm_rep)

log_pm_ppd9
## # A tibble: 26,820 x 5
##    .draw city_id log_sat observed sampled
##    <int>   <int>   <dbl>    <dbl>   <dbl>
##  1    56       1    3.51     3.51    3.13
##  2    84       1    3.51     3.51    4.49
##  3   261       1    3.51     3.51    3.52
##  4   395       1    3.51     3.51    3.65
##  5   636       1    3.51     3.51    3.65
##  6  1047       1    3.51     3.51    3.66
##  7  1234       1    3.51     3.51    4.04
##  8  1313       1    3.51     3.51    3.58
##  9  1502       1    3.51     3.51    3.38
## 10    56       2    2.29     2.71    3.38
## # ... with 26,810 more rows
log_pm_ppd9 %>% 
    gather("type", "log_pm25", observed:sampled) %>% 
    ggplot(aes(x = log_sat, y = log_pm25, color = type)) +
    geom_point(size = 0.5, alpha = 0.3) + 
    facet_wrap(~ .draw)

While they certainly overlap, (as expected) there’s some interesting structure in the real data that the PPD doesn’t capture. Breaking out by super region exacerbates the differences:

log_pm_ppd9 %>%  
    left_join(data %>% select(city_id, super_region_name), by = "city_id") %>% 
    gather("type", "log_pm_25", observed:sampled) %>% 
    ggplot(aes(x = log_sat, y = log_pm_25, color = type)) +
    geom_point(alpha = 0.1, size = 0.1) + 
    geom_smooth(method = "lm", se = FALSE) + 
    facet_wrap(~ super_region_name)

If we compare the histogram of observed \(PM_{2.5}\) values with samples from the PPD we see that the real data has some asymmetry that the synthetic data is missing:

log_pm_ppd9 %>% 
    gather("type", "log_pm_25", observed:sampled) %>% 
    ggplot(aes(x = log_pm_25, group = str_c(.draw, type))) +
    geom_freqpoly(aes(color = type), bins = 20)

5.2 Comparison with Summary Statistics

A danger with posterior predictive checking is that we’re double dipping with our data - using it once to fit the model and a second time to check it. When we’re comparing summary statistics, a poor choice of statistic could obscure a serious problem with the model.

Consider, for example, a model where one parameter controls the mean of the posterior distribution. Comparing means of samples from the PPD will probably look good (because that parameter can learn the mean of your data easily) even if the shape of the fit is terrible. It’s good practice, then, to select test statistics that aren’t directly related to a single model parameter.

In this case - since our histogram showed that the real data is much less symmetric than the posterior predictive distribution, let’s use the skewness (a measure of distribution asymmetry):

skew <- function(x) {
    mu <- mean(x)
    numerator <- mean((x - mu)^3)
    denominator <- mean((x - mu)^2)^(3/2)
    numerator/denominator
}

observed_skewness <- skew(data$log_pm25)
observed_skewness
## [1] 0.5562795
log_pm_ppd %>% 
    group_by(.draw) %>% 
    summarize(sk = skew(log_pm_rep)) %>% 
    ggplot(aes(x = sk)) + 
    geom_histogram(bins = 30) +
    geom_vline(xintercept = observed_skewness, linetype = "dashed")

Breaking out a statistic (let’s try median this time) by WHO super region, we’ll see that not only do the samples not resemble the observed data, but they’re wrong in different directions for different regions! If we were answering questions that compared estimates for different geographical regions, this would be a first step to getting really dumb answers.

ppd_median_by_region <- log_pm_ppd %>% 
    left_join(data %>% select(city_id, super_region_name), by = "city_id") %>% 
    group_by(super_region_name, .draw) %>% 
    summarize(med = median(log_pm_rep)) %>% 
    ungroup()

data_median_by_region <- data %>% 
    group_by(super_region_name) %>% 
    summarize(med = median(log_pm25))

ggplot() + 
    geom_histogram(aes(x = med), data = ppd_median_by_region, bins = 30) +
    geom_vline(aes(xintercept = med), linetype = "dashed", data = data_median_by_region) + 
    facet_wrap(~ super_region_name)

6. The undiscovered country (Your Turn)

What’s next? By this point we should all agree that this problem deserves a model more nuanced than simple linear regression. Build a better model and run some of the same tests to see how it performs!

  • One option: run separate regressions for different regions or clusters
  • Another option, in between “one big stupid regression” and “a ton of independent regressions”: build a hierarchical model that has separate parameters for each super region (or cluster), but treats those parameters as draws from a common distribution to be learned. This allows the different regions to use some information from each other while still preserving conditional independence. We’ll do a lot more with hierarchical models later!

Also: hierarchical model, ripped off from Gabry’s Github:

hier_file <- "../../stan/particulate_matter_multilevel.stan"
cat(read_file(hier_file))
// multilevel model for particulate matter
// assume non-centered data
data {
  int<lower=1> N;                   // number of observations
  int<lower=1> R;                   // number of super regions
  int<lower=1,upper=R> region[N];   // region IDs
  vector[N] log_sat;                // log of satellite measurements
  vector[N] log_pm;                 // log of ground PM_2.5 measurements
}
parameters {
  real beta0;                       // global intercept
  real beta1;                       // global slope
  vector[R] beta0_region_raw;       // 'raw' region intercept offsets for NCP
  vector[R] beta1_region_raw;       // 'raw' region slope offsets for NCP
  real<lower=0> tau0;               // sd of beta0_region
  real<lower=0> tau1;               // sd of beta1_region
  real<lower=0> sigma;              // error sd for Gaussian likelihood
}
model {
  // mean of likelihood
  vector[N] mu;

  // priors
  beta0 ~ normal(0,1);
  beta1 ~ normal(1,1);
  tau0 ~ normal(0,1);
  tau1 ~ normal(0,1);
  beta0_region_raw ~ normal(0,1);
  beta1_region_raw ~ normal(0,1);
  sigma ~ normal(0,1);

  // construct mean of each observation
  // each region has a unique intercept and slope...
  // ...so we iterate through the observations and
  // lookup the betaX_region that matches the region
  // of observation i
  for (i in 1:N) {
    mu[i] = (beta0 + beta0_region_raw[region[i]]) + (beta1 + beta1_region_raw[region[i]]) * log_sat[i];
  }

  // likelihood
  log_pm ~ normal(mu, sigma);
}
generated quantities {
  vector[N] log_lik;    // pointwise log-likelihood for LOO
  vector[N] log_pm_rep; // replications from posterior predictive dist

  for (i in 1:N) {
    // what's the mean for observation i?
    real log_pm_hat_n;
    log_pm_hat_n  = (beta0 + beta0_region_raw[region[i]]) + (beta1 + beta1_region_raw[region[i]]) * log_sat[i];

    // now compute log likelihood and simulate from posterior
    log_lik[i] = normal_lpdf(log_pm[i] | log_pm_hat_n, sigma);
    log_pm_rep[i] = normal_rng(log_pm_hat_n, sigma);
  }
}

6.1 The lab

Compile and sample values from the above hierarchical model. Perform the same posterior predictive checks we did for previous examples. Contrast the results to determine if this model is an improvement over the other ones.